Phrase-based Image Captioning with Hierarchical LSTM Model
نویسندگان
چکیده
Automatic generation of caption to describe the content of an image has been gaining a lot of research interests recently, where most of the existing works treat the image caption as pure sequential data. Natural language, however possess a temporal hierarchy structure, with complex dependencies between each subsequence. In this paper, we propose a phrasebased hierarchical Long Short-Term Memory (phi-LSTM) model to generate image description. In contrast to the conventional solutions that generate caption in a pure sequential manner, our proposed model decodes image caption from phrase to sentence. It consists of a phrase decoder at the bottom hierarchy to decode noun phrases of variable length, and an abbreviated sentence decoder at the upper hierarchy to decode an abbreviated form of the image description. A complete image caption is formed by combining the generated phrases with sentence during the inference stage. Empirically, our proposed model shows a better or competitive result on the Flickr8k, Flickr30k and MS-COCO datasets in comparison to the state-of-the art models. We also show that our proposed model is able to generate more novel captions (not seen in the training data) which are richer in word contents in all these three datasets.
منابع مشابه
phi-LSTM: A Phrase-Based Hierarchical LSTM Model for Image Captioning
A picture is worth a thousand words. Not until recently, however, we noticed some success stories in understanding of visual scenes: a model that is able to detect/name objects, describe their attributes, and recognize their relationships/interactions. In this paper, we propose a phrase-based hierarchical Long Short-Term Memory (phi-LSTM) model to generate image description. The proposed model ...
متن کاملSelf-Guiding Multimodal LSTM - when we do not have a perfect training dataset for image captioning
In this paper, a self-guiding multimodal LSTM (sg-LSTM) image captioning model is proposed to handle uncontrolled imbalanced real-world image-sentence dataset. We collect FlickrNYC dataset from Flickr as our testbed with 306, 165 images and the original text descriptions uploaded by the users are utilized as the ground truth for training. Descriptions in FlickrNYC dataset vary dramatically rang...
متن کاملHierarchical LSTM with Adjusted Temporal Attention for Video Captioning
Recent progress has been made in using attention based encoder-decoder framework for video captioning. However, most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., ”gun” and ”shooting”) and non-visual words (e.g. ”the”, ”a”). However, these non-visual words can be easily predicted using natural language model without considering visual...
متن کاملImage Titles - Variations on Show, Attend and Tell
Inspired by recent advances in machine translation and object detection, we implement an image captioning pipeline, consisting of a Fully Convolutional Neural Network piping image features into an image-captioning LSTM, based on the popular Show, Attend, and Tell model. We implement the model in TensorFlow and recreate performance metrics reported in the paper. We identify and experiment with v...
متن کاملAutomated Image Captioning Using Nearest-Neighbors Approach Driven by Top-Object Detections
The significant performance gains in deep learning coupled with the exponential growth of image and video data on the Internet have resulted in the recent emergence of automated image captioning systems. Two broad paradigms have emerged in automated image captioning, i.e., generative model-based approaches and retrieval-based approaches. Although generative model-based approaches that use the r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1711.05557 شماره
صفحات -
تاریخ انتشار 2017